Multiple regression is one of the most popular methods used in
statistics as well as in machine learning. We use linear models as a
working model for its simplicity and interpretability. It is important
that we use domain knowledge as much as we could to determine the form
of the response as well as the function format for the factors. Then,
when we have many possible features to be included in the working model
it is inevitable that we need to choose a best possible model with a
sensible criterion. Cp, BIC and
regularizations such as LASSO are introduced. Be aware that if a model
selection is done formally or informally, the inferences obtained with
the final lm() fit may not be valid. Some adjustment will
be needed. This last step is beyond the scope of this class. Check the
current research line that Linda and collaborators are working on.
This homework consists of two parts: the first one is an exercise
(you will feel it being a toy example after the covid case study) to get
familiar with model selection skills such as, Cp and
BIC. The main job is a rather involved case study about
devastating covid19 pandemic. Please read through the case study first.
This project is for sure a great one listed in your CV.
For covid case study, the major time and effort would be needed in EDA portion.
Model building process
Methods
Understand the criteria
CpBICK fold Cross ValidationLASSOPackages
lm(), Anovaregsubsets()glmnet() & cv.glmnet()Review the code and concepts covered during lectures: multiple regression, model selection and penalized regression through elastic net.
This case was provided in our Homework 2 submission.
ISLR::Auto dataThis will be the last part of the Auto data from ISLR. The original
data contains 408 observations about cars. It has some similarity as the
Cars data that we use in our lectures. To get the data, first install
the package ISLR. The data set Auto should be
loaded automatically. We use this case to go through methods learned so
far.
Final modelling question: We want to explore the effects of each feature as best as possible.
GPM=1/MPG. Compare residual plots of MPG or GPM as
responses and see which one might yield a more satisfactory
patterns.The residual plots indicate using GPM as response is more appropriate. The residual plot of GPM (on the right) is more close to a horizontal line, which indicates the relationship is more linear. Using GPM as response variable meets the regression assumptions. The residual plot of MPG suggests there may exist non-linear relationship.
In addition, can you provide some background knowledge to support the
notion: it makes more sense to model GPM?
GPM measures fuel consumed per unit of distance, whereas MPG measures distance per unit of fuel consumed. GPM can reflect the actual difference in fuel consumption conditioning on same miles.
## Analysis of Variance Table
##
## Model 1: gpm ~ cylinders + displacement + horsepower + weight + acceleration +
## year + origin
## Model 2: gpm ~ cylinders + displacement + horsepower + weight + acceleration +
## year + origin + horsepower * origin
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 383 0.0122
## 2 381 0.0122 2 7.2e-05 1.13 0.33
Because origin might affect horsepower, we checked interaction term of origin and horsepower. The p-value of F statistic is 0.3256, therefore, we fail to reject \(H_0\), there is no significant difference between the model without interaction term and model with interaction term. That is to say, the interaction of horsepower and origin is not significant, and we do not need to include it in our model.
## Subset selection object
## Call: regsubsets.formula(gpm ~ ., data = Auto_gpm, nvmax = 10, method = "exhaustive")
## 8 Variables (and intercept)
## Forced in Forced out
## cylinders FALSE FALSE
## displacement FALSE FALSE
## horsepower FALSE FALSE
## weight FALSE FALSE
## acceleration FALSE FALSE
## year FALSE FALSE
## origin2 FALSE FALSE
## origin3 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## cylinders displacement horsepower weight acceleration year origin2
## 1 ( 1 ) " " " " " " "*" " " " " " "
## 2 ( 1 ) " " " " " " "*" " " "*" " "
## 3 ( 1 ) " " " " "*" "*" " " "*" " "
## 4 ( 1 ) " " " " "*" "*" " " "*" "*"
## 5 ( 1 ) " " " " "*" "*" "*" "*" "*"
## 6 ( 1 ) "*" " " "*" "*" "*" "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## origin3
## 1 ( 1 ) " "
## 2 ( 1 ) " "
## 3 ( 1 ) " "
## 4 ( 1 ) " "
## 5 ( 1 ) " "
## 6 ( 1 ) " "
## 7 ( 1 ) "*"
## 8 ( 1 ) "*"
## [1] 345.2 44.0 18.9 16.1 14.1 13.2 11.2 9.0
By using Mallow’s \(C_p\), we can see a model with all 8 variables has the smaller prediction error.
## [1] "(Intercept)" "horsepower" "weight" "acceleration" "year"
## [6] "origin2"
## (Intercept) horsepower weight acceleration year origin2
## 9.66e-02 1.08e-04 1.16e-05 3.30e-04 -1.31e-03 -1.87e-03
Since the \(C_p\) values are similar in scale, we choose final model with 5 variables. Small \(C_p\) value indicates more accurate model. The final model is:\[gpm = 0.09662+0.0001081*horsepower+0.00001155*weight+0.00033*acceleration-0.001307year-0.00187*origin2\]
From plot 1.2, we can see RSS is pretty small when selecting 5 variables.
Plot 1.3 and 1.4 indicates the final model meets linearity and homoscedasticity assumption.
summary(final_fit)##
## Call:
## lm(formula = gpm ~ horsepower + weight + acceleration + year +
## origin2, data = Auto_gpm1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.016142 -0.003487 -0.000249 0.002937 0.024335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.66e-02 7.91e-03 12.22 < 2e-16 ***
## horsepower 1.08e-04 2.23e-05 4.85 1.8e-06 ***
## weight 1.16e-05 7.88e-07 14.66 < 2e-16 ***
## acceleration 3.30e-04 1.67e-04 1.98 0.048 *
## year -1.31e-03 8.82e-05 -14.81 < 2e-16 ***
## origin21 -1.87e-03 8.13e-04 -2.30 0.022 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00572 on 386 degrees of freedom
## Multiple R-squared: 0.884, Adjusted R-squared: 0.882
## F-statistic: 586 on 5 and 386 DF, p-value: <2e-16\(R^2\) is 0.8835 when selecting 5 variables, which suggests final model fits the data well.
The p-values of all variables selected are significant. Interpretation of \(\hat{\beta_i}\): take horsepower as an example, controlling for other variables, gpm will increase 0.0001081 if there is one unit increase in horsepower.
mpg of a car that is: built in 1983, in the
US, red, 180 inches long, 8 cylinders, 350 displacement, 260 as
horsepower, and weighs 4,000 pounds. Give a 95% CI.The predicted mpg for this car is 14.8. 95% CI for mpg is (13.6, 16.2).
## [1] 392 8
The sample size of the study is 392. More observations can be collected if possible. Also, the study could include more variables, such as Type of Transmission.
The outbreak of the novel Corona virus disease 2019 (COVID-19) was declared a public health emergency of international concern by the World Health Organization (WHO) on January 30, 2020. Upwards of 755 million cases have been confirmed worldwide, with nearly 6.8 million associated deaths by Feb of 2023. Within the US alone, there have been over 1.1 million deaths and upwards of 102 million cases reported by Feb of 2023. Governments around the world have implemented and suggested a number of policies to lessen the spread of the pandemic, including mask-wearing requirements, travel restrictions, business and school closures, and even stay-at-home orders. The global pandemic has impacted the lives of individuals in countless ways, and though many countries have begun vaccinating individuals, the long-term impact of the virus remains unclear.
The impact of COVID-19 on a given segment of the population appears to vary drastically based on the socioeconomic characteristics of the segment. In particular, differing rates of infection and fatalities have been reported among different racial groups, age groups, and socioeconomic groups. One of the most important metrics for determining the impact of the pandemic is the death rate, which is the proportion of people within the total population that die due to the the disease.
We assemble this dataset for our research with the goal to investigate the effectiveness of lockdown on flattening the COVID curve. We provide a portion of the cleaned dataset for this case study.
There are two main goals for this case study.
Remark1: The data and the statistics reported here were collected before February of 2021.
Remark 2: A group of RAs spent tremendous amount of time working together to assemble the data. It requires data wrangling skills.
Remark 3: Please keep track with the most updated version of this write-up.
The data comes from several different sources:
In this case study, we use the following three nearly cleaned data:
Among all data, the unique identifier of county is
FIPS.
The cleaning procedure is attached in
Appendix 2: Data cleaning You may go through it if you are
interested or would like to make any changes.
It may need more data wrangling.
First read in the data.
The detailed description of variables is in
Appendix 1: Data description. Please get familiar with the
variables. Summarize the two data briefly.
The dataframe “county_data” contains socioeconomic and demographic information, in different years, of each of the 3278 counties in the United States, which includes the counties among the 50 states and those in Puerto Rico. While it would not be possible to summarize the entirety of the socioeconomic dataset, a selection of variables are presented herein. The median per capita income across all counties in the US is $26,720, ranging from $5,974-$72,832. The median county population size as of 2019 was 26,700, ranging from 86-328,000,000. The median percentage of residents per country who were not born in the US is 2.8, ranging from 0-53.3%. The median county-level unemployment rate as of 2019 was 3.7%, ranging from 0.7-19.3%.
The dataframe “covid_rate” contains cumulative statistics for COVID cases and deaths across 397 days (start date: 2020-01-21, end date: 2021-02-20) among the counties. The median cumulative cases by county during this time period was 366, ranging from a minimum of 0 at the start of the period to a maximum of 1,179,633 at the end of the period. By scanning the data by state, we can see that individual states experienced varying numbers of COVID cases and deaths.
It is crucial to decide the right granularity for visualization and analysis. We will compare daily vs weekly total new cases by state and we will see it is hard to interpret daily report.
As we can see in Figure 1, it is nearly impossible to interpret more than a year’s worth of data using daily values because there are too many data points. Although we can see slight variations in “clumps” of higher daily rates versus lower daily rates, overall it is not feasible to interpret data at this granularity. We can also see a few apparent outliers, but it is challenging to draw any meaningful conclusions. Additionally, without controlling for the population size of the state, we can’t be sure whether the number of new cases is large or small relative to the population.
On a more practical note, COVID testing result release practices varied by county which could bias any daily-level results. For example, while most counties collected and processed COVID tests 7 days a week, aggregated results were often only released on business days. This resulted in apparent population-level spikes in new cases on Mondays, when in reality this was due to the fact that test results from Saturday, Sunday, and Monday were all released on Monday.
weekly_case_per100k. Plot the spaghetti plots of
weekly_case_per100k by state. Use
TotalPopEst2019 as population.We can see in Figures 2a-2d that each state in the US experienced slightly different patterns regarding COVID-19 cases. One of the more noticeable differences is in the timing of case peaks. For example, New York appears to have one of the earliest peaks, perhaps driven by the high population density in New York City (NYC) and presence of three major international airports within NYC limits. In contrast, Pennsylvania’s largest peak was not until closer to 2021. It is also evident that despite normalizing COVID-19 cases per 100k population, there are drastic differences in cases by state. It is likely that population density is a major driver of this difference; for example, California has a high number of weekly cases whereas Wyoming has a very low number of cases. Population density cannot explain all of the variation, however, as some states with relatively high density, such as Connecticut and Pennsylvania, have low numbers overall. It is possible that some differences are also driven by state- and county-level policies which restricted the flow of people and therefore the spread of COVID-19.
covid_intervention to see whether the
effectiveness of lockdown in flattening the curve.limits argument in scale_fill_gradient() or
use facet_wrap(); use lubridate::month() and
lubridate::year() to extract month and year from date; use
tidyr::complete(state, month, fill = list(new_case_per100k = NA))
to complete the missing months with no cases.)plotly to animate the monthly maps in
i). Does it reveal any systematic way to capture the dynamic changes
among states? (Hints: Follow Appendix 3: Plotly heatmap:: in
Module 6 regularization lecture to plot the heatmap using
plotly. Use frame argument in
add_trace() for animation. plotly only
recognizes abbreviation of state names. Use
unique(us_map(regions = "states") %>% select(abbr, full))
to get the abbreviation and merge with the data to get state
abbreviation.)We now try to build a good parsimonious model to find possible factors related to death rate on county level. Let us not take time series into account for the moment and use the total number as of Feb 1, 2021.
Create the response variable total_death_per100k as
the total of number of COVID deaths per 100k by Feb 1, 2021. We
suggest to take log transformation as
log_total_death_per100k = log(total_death_per100k + 1).
Merge total_death_per100k to county_data for
the following analysis.
Select possible variables in county_data as
covariates. We provide county_data_sub, a subset variables
from county_data, for you to get started. Please add any
potential variables as you wish.
Our final analysis data set includes the following missing variables:
State: None
log_total_death_per100k: 218
Deep_Pov_all: 53
PovertyAllAgesPct: 133
PerCapitaInc: 53
UnempRate2019: 54
PctEmpFIRE: 54
PctEmpConstruction: 54
PctEmpTrans: 54
PctEmpMining: 54
PctEmpTrade: 54
PctEmpInformation: 54
PctEmpAgriculture: 54
PctEmpManufacturing: 54
PctEmpServices: 54
PopDensity2010: 53
TotalOccHU: 54
AvgHHSize: 53
OwnHomePct: 53
Age65AndOlderPct2010: 53
TotalPop25Plus: 53
Under18Pct2010: 53
Ed1LessThanHSPct: 53
Ed2HSDiplomaOnlyPct: 53
Ed3SomeCollegePct: 53
Ed4AssocDegreePct: 53
Ed5CollegePlusPct: 53
ForeignBornPct: 132
Net_International_Migration_Rate_2010_2019: 132
NetMigrationRate1019: 132
NaturalChangeRate1019: 132
TotalPopEst2019: 53
WhiteNonHispanicPct2010: 53
NativeAmericanNonHispanicPct2010: 53
BlackNonHispanicPct2010: 53
AsianNonHispanicPct2010: 53
HispanicPct2010: 53
Type_2015_Update: 183
RuralUrbanContinuumCode2013: 104
UrbanInfluenceCode2013: 104
Hipov: 105
Perpov_1980_0711: 183
HiCreativeClass2000: 187
HiAmenity: 219
Retirement_Destination_2015_Update: 183
lambda.1se to choose a smaller model.Using a lambda of 0.0055, our final LASSO model includes 79 variables, which includes binary indicators for state. In our LASSO process we needed to force in State because it is a categorical variable requiring N-1 indicator variables. Given that, we don’t want the LASSO process to evaluate the coefficients individually, but rather as a group.
##
## Call:
## lm(formula = log_total_death_per100k ~ ., data = covid_final_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.637 -0.269 0.059 0.374 3.759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.31e+00 6.22e-01 6.93 5.0e-12 ***
## StateArizona 3.76e-01 2.50e-01 1.50 0.13257
## StateArkansas 8.37e-02 1.37e-01 0.61 0.53986
## StateCalifornia -6.17e-01 1.83e-01 -3.38 0.00073 ***
## StateColorado -2.25e-01 1.61e-01 -1.40 0.16156
## StateConnecticut 2.52e-01 3.08e-01 0.82 0.41251
## StateDelaware -1.39e-01 4.69e-01 -0.30 0.76759
## StateDistrict of Columbia 8.40e-01 8.46e-01 0.99 0.32086
## StateFlorida 4.40e-03 1.53e-01 0.03 0.97709
## StateGeorgia 1.02e-01 1.17e-01 0.87 0.38258
## StateIdaho -1.82e-01 1.72e-01 -1.06 0.29016
## StateIllinois 2.69e-01 1.42e-01 1.89 0.05839 .
## StateIndiana -5.31e-02 1.40e-01 -0.38 0.70371
## StateIowa 2.60e-01 1.43e-01 1.82 0.06931 .
## StateKansas -6.36e-01 1.43e-01 -4.45 8.9e-06 ***
## StateKentucky -6.48e-01 1.35e-01 -4.78 1.8e-06 ***
## StateLouisiana 4.31e-01 1.48e-01 2.92 0.00353 **
## StateMaine -1.34e+00 2.29e-01 -5.84 5.7e-09 ***
## StateMaryland 3.58e-02 2.04e-01 0.18 0.86022
## StateMassachusetts 1.13e-01 2.47e-01 0.46 0.64666
## StateMichigan -5.30e-02 1.47e-01 -0.36 0.71902
## StateMinnesota -9.09e-02 1.50e-01 -0.61 0.54468
## StateMississippi 4.69e-01 1.40e-01 3.36 0.00080 ***
## StateMissouri -3.54e-01 1.32e-01 -2.68 0.00741 **
## StateMontana 6.60e-01 1.68e-01 3.94 8.4e-05 ***
## StateNebraska -3.76e-01 1.51e-01 -2.49 0.01279 *
## StateNevada -5.31e-01 2.41e-01 -2.20 0.02804 *
## StateNew Hampshire -7.93e-01 2.77e-01 -2.87 0.00419 **
## StateNew Jersey 3.47e-01 2.21e-01 1.57 0.11681
## StateNew Mexico -4.63e-01 2.04e-01 -2.27 0.02312 *
## StateNew York -3.30e-01 1.61e-01 -2.05 0.04086 *
## StateNorth Carolina -3.54e-01 1.30e-01 -2.72 0.00649 **
## StateNorth Dakota 9.20e-01 1.76e-01 5.24 1.7e-07 ***
## StateOhio -4.61e-01 1.44e-01 -3.19 0.00142 **
## StateOklahoma -4.14e-01 1.43e-01 -2.89 0.00393 **
## StateOregon -7.11e-01 1.83e-01 -3.88 0.00011 ***
## StatePennsylvania 1.08e-01 1.59e-01 0.68 0.49788
## StateRhode Island -1.43e-01 3.75e-01 -0.38 0.70395
## StateSouth Carolina -8.49e-02 1.53e-01 -0.56 0.57852
## StateSouth Dakota 8.14e-01 1.64e-01 4.95 7.8e-07 ***
## StateTennessee 6.35e-02 1.32e-01 0.48 0.63169
## StateTexas 1.75e-01 1.29e-01 1.35 0.17623
## StateUtah -7.90e-01 2.03e-01 -3.89 0.00010 ***
## StateVermont -2.10e+00 2.43e-01 -8.64 < 2e-16 ***
## StateVirginia -5.58e-01 1.33e-01 -4.21 2.6e-05 ***
## StateWashington -6.45e-01 1.84e-01 -3.50 0.00048 ***
## StateWest Virginia -5.81e-01 1.65e-01 -3.51 0.00045 ***
## StateWisconsin -9.90e-02 1.52e-01 -0.65 0.51589
## StateWyoming 4.34e-01 2.13e-01 2.03 0.04200 *
## PovertyAllAgesPct 1.13e-02 6.13e-03 1.84 0.06641 .
## PerCapitaInc -8.97e-06 6.30e-06 -1.42 0.15442
## UnempRate2019 -6.21e-02 1.70e-02 -3.65 0.00027 ***
## PctEmpFIRE 3.13e-02 1.13e-02 2.77 0.00563 **
## PctEmpConstruction -3.42e-02 8.59e-03 -3.98 7.1e-05 ***
## PctEmpMining -5.58e-03 7.18e-03 -0.78 0.43745
## PctEmpTrade 1.13e-02 7.24e-03 1.56 0.11845
## PctEmpInformation -2.58e-02 2.10e-02 -1.23 0.22039
## PctEmpAgriculture -2.97e-02 6.06e-03 -4.89 1.0e-06 ***
## PctEmpManufacturing 8.64e-03 5.12e-03 1.69 0.09204 .
## PctEmpServices 8.64e-03 5.60e-03 1.54 0.12267
## PopDensity2010 1.74e-05 2.99e-05 0.58 0.56050
## AvgHHSize -2.77e-01 1.04e-01 -2.65 0.00802 **
## Age65AndOlderPct2010 4.17e-02 8.43e-03 4.94 8.2e-07 ***
## TotalPop25Plus 1.06e-08 8.47e-08 0.12 0.90077
## Under18Pct2010 6.76e-02 9.25e-03 7.31 3.5e-13 ***
## Ed1LessThanHSPct 8.69e-03 6.05e-03 1.43 0.15149
## Ed3SomeCollegePct -2.18e-02 6.31e-03 -3.46 0.00056 ***
## Ed5CollegePlusPct -1.28e-02 5.09e-03 -2.52 0.01180 *
## ForeignBornPct 5.23e-03 6.45e-03 0.81 0.41788
## NetMigrationRate1019 -1.22e-02 2.75e-03 -4.44 9.4e-06 ***
## NaturalChangeRate1019 -4.51e-02 1.04e-02 -4.33 1.5e-05 ***
## WhiteNonHispanicPct2010 -2.96e-03 1.96e-03 -1.52 0.12984
## NativeAmericanNonHispanicPct2010 7.00e-03 3.45e-03 2.03 0.04267 *
## AsianNonHispanicPct2010 1.19e-02 1.34e-02 0.88 0.37645
## HispanicPct2010 7.79e-03 2.91e-03 2.68 0.00741 **
## Type_2015_Update -7.95e-03 9.06e-03 -0.88 0.38010
## RuralUrbanContinuumCode2013 -3.18e-02 8.89e-03 -3.58 0.00035 ***
## Perpov_1980_0711 -1.67e-01 6.57e-02 -2.54 0.01124 *
## HiCreativeClass2000 -6.54e-02 5.62e-02 -1.17 0.24401
## HiAmenity -1.11e-01 4.91e-02 -2.25 0.02448 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.787 on 2978 degrees of freedom
## Multiple R-squared: 0.373, Adjusted R-squared: 0.357
## F-statistic: 22.5 on 79 and 2978 DF, p-value: <2e-16
Cp or BIC to fine tune the LASSO model from iii).
Again force in State in the process. (You could do
backward elimination to avoid using Cp or BIC)Further reducing the model, we now have 60 variables included in the model, including binary indicators for state. All of the variables (excluding some of the individual state indicator variables).
##
## Call:
## lm(formula = log_total_death_per100k ~ ., data = covid_final_subset_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.661 -0.260 0.063 0.383 3.663
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.04403 0.32041 12.62 < 2e-16 ***
## StateArizona 0.33109 0.24158 1.37 0.17063
## StateArkansas 0.08018 0.13446 0.60 0.55104
## StateCalifornia -0.62994 0.16150 -3.90 9.8e-05 ***
## StateColorado -0.30859 0.15221 -2.03 0.04270 *
## StateConnecticut 0.20217 0.30180 0.67 0.50298
## StateDelaware -0.02446 0.46915 -0.05 0.95842
## StateDistrict of Columbia 0.89154 0.80607 1.11 0.26880
## StateFlorida -0.01719 0.14306 -0.12 0.90434
## StateGeorgia 0.09933 0.11638 0.85 0.39348
## StateIdaho -0.28220 0.16413 -1.72 0.08565 .
## StateIllinois 0.27978 0.13312 2.10 0.03565 *
## StateIndiana -0.06188 0.13112 -0.47 0.63699
## StateIowa 0.30414 0.13146 2.31 0.02076 *
## StateKansas -0.60940 0.13500 -4.51 6.6e-06 ***
## StateKentucky -0.69953 0.12486 -5.60 2.3e-08 ***
## StateLouisiana 0.37441 0.14472 2.59 0.00972 **
## StateMaine -1.36003 0.22305 -6.10 1.2e-09 ***
## StateMaryland -0.01223 0.19745 -0.06 0.95063
## StateMassachusetts 0.12661 0.23970 0.53 0.59739
## StateMichigan -0.00827 0.13867 -0.06 0.95243
## StateMinnesota -0.05589 0.13882 -0.40 0.68725
## StateMississippi 0.46876 0.13859 3.38 0.00073 ***
## StateMissouri -0.36579 0.12443 -2.94 0.00331 **
## StateMontana 0.55669 0.15603 3.57 0.00037 ***
## StateNebraska -0.34969 0.14021 -2.49 0.01268 *
## StateNevada -0.73222 0.22781 -3.21 0.00132 **
## StateNew Hampshire -0.80046 0.27216 -2.94 0.00330 **
## StateNew Jersey 0.40566 0.20774 1.95 0.05094 .
## StateNew Mexico -0.58659 0.19202 -3.05 0.00227 **
## StateNew York -0.34909 0.15091 -2.31 0.02078 *
## StateNorth Carolina -0.26834 0.12842 -2.09 0.03674 *
## StateNorth Dakota 0.90489 0.16139 5.61 2.2e-08 ***
## StateOhio -0.47680 0.13494 -3.53 0.00042 ***
## StateOklahoma -0.43524 0.13458 -3.23 0.00123 **
## StateOregon -0.76334 0.17535 -4.35 1.4e-05 ***
## StatePennsylvania 0.09172 0.14813 0.62 0.53582
## StateRhode Island -0.21264 0.37195 -0.57 0.56757
## StateSouth Carolina -0.04028 0.15214 -0.26 0.79120
## StateSouth Dakota 0.86403 0.14974 5.77 8.7e-09 ***
## StateTennessee 0.07012 0.12861 0.55 0.58565
## StateTexas 0.10766 0.12431 0.87 0.38652
## StateUtah -1.07158 0.19034 -5.63 2.0e-08 ***
## StateVermont -2.10799 0.23753 -8.87 < 2e-16 ***
## StateVirginia -0.54264 0.12936 -4.19 2.8e-05 ***
## StateWashington -0.67982 0.17634 -3.86 0.00012 ***
## StateWest Virginia -0.70041 0.15242 -4.60 4.5e-06 ***
## StateWisconsin -0.05521 0.14087 -0.39 0.69515
## StateWyoming 0.18081 0.20183 0.90 0.37040
## PovertyAllAgesPct 0.01873 0.00413 4.53 6.0e-06 ***
## UnempRate2019 -0.06480 0.01660 -3.90 9.7e-05 ***
## PctEmpConstruction -0.04883 0.00687 -7.11 1.4e-12 ***
## PctEmpAgriculture -0.03731 0.00347 -10.76 < 2e-16 ***
## Age65AndOlderPct2010 0.04703 0.00807 5.82 6.3e-09 ***
## Under18Pct2010 0.07043 0.00767 9.19 < 2e-16 ***
## Ed3SomeCollegePct -0.02565 0.00537 -4.78 1.8e-06 ***
## Ed5CollegePlusPct -0.01522 0.00252 -6.04 1.7e-09 ***
## NetMigrationRate1019 -0.01389 0.00258 -5.39 7.5e-08 ***
## NaturalChangeRate1019 -0.04308 0.01004 -4.29 1.8e-05 ***
## HispanicPct2010 0.00925 0.00191 4.84 1.4e-06 ***
## RuralUrbanContinuumCode2013 -0.04244 0.00822 -5.16 2.6e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.792 on 2997 degrees of freedom
## Multiple R-squared: 0.361, Adjusted R-squared: 0.349
## F-statistic: 28.3 on 60 and 2997 DF, p-value: <2e-16
Based on the residual plot and Q-Q plot, it is questionable whether we can consider the assumptions for a linear model to be met. While the residual plot appears to have a slight cone-shaped pattern, we can probably say that homoscedasticity is reasonable met. The Q-Q plot, however shows a fairly drastic departure from normality, indicating that we may need to transform some of our variables to better meet this assumption.
Our final model did not include the percentage of Black/African American individuals as a predictor and we are therefore unable to determine the role this plays in COVID-19 deaths. However, in many instances, race is a proxy for other socioeconomic factors, and in this case we may have captured some of the racial disparities through the inclusion of poverty, education, and urbanicity in our model. We can see in our model that for every percentage increase in the percentage of Hispanic residents, counties would experience an approximately 0.9% increase in COVID-19 deaths. Similarly, for every percentage increase in the percentage of residents 65 and older, counties would experience an approximately 4.4% increase in COVID-19 deaths. In both of these cases, we can say that our analysis supports the argument that the elderly and Hispanic individuals have been disproportionately impacted by COVID.
Based on our final model, we found that controlling for other important factors, some states experience greater deaths/100,000K population that others. Our model compares given states to the reference state of Alabama. Interestingly, while some states have significantly different COVID-19 mortality rates after controlling for other factors, other states are not significantly different, implying that for some states, sociodemographic factors are more important than other state-level factors which are captured by the state variable. We also found that the net migration and natural population change rates of the states from 2010 to 2019 contribute negatively to COVID-19 death rate (i.e., are associated with fewer deaths). When the natural population change rate increases (meaning more individuals are being born than dying), the number of COVID deaths decreases. This could be related to the fact that the elderly are at higher risk of dying than younger individuals, so the younger the population is (which can be impacted by the number of babies being born), the fewer COVID deaths there are. Perhaps less intuitively, we found that net migration rate was negatively associated with COVID deaths, meaning that the larger number of people entering a state, the fewer COVID deaths. Again, counter-intuitively, this may suggest that eliminating state policies that restrict people’s movement can possibly reduce COVID-19 death rate. Finally, our model indicates that higher percentages of residents with higher education is associated with decreased COVID-19 deaths; this could point to yet another motivation for government-level interventions that improve public education and make higher education more affordable.
We could improve our model by including factors concerning the outdoor environment and lifestyle of people in the counties, such as level of air pollution and geographical landscape. This is because the county classification information that we have is not enough to infer about general lifestyle of the population at the county, which could contribute to COVID case count and thus COVID death rate. During the period captured by this data, COVID vaccines were not yet readily available to the general population. However, models with more recent data should incorporate vaccination percentages as vaccination has been shown to reduce the severity of illness. We could also consider the quality of clinical care at each county, which could directly impact the COVID death rate. In particular, factors which may be associated with COVID-19 survival is access to testing (to ensure early treatment), nurse to patient ratio, and availability of hospital beds and intensive interventions such as ventilators.
softImpute.It is possible that our results may have been different had we imputed missing values rather than dropping cases with missing data. However, we had minimal missing data and only ended up dropping 268 observations, or 8% of the total data. In general, it’s considered cause for alarm if there is more than 10% missingness. With that said, it is always important to consider WHY some data points might be missing, because even small amounts of missing data could bias your results if the missingness is informative.
Please summarize this project as follows (no more than one page):
The goal of this COVID study is to investigate possible factors that relate to death rate on a county level and determine the ones that are most significant.
Our data was gathered from different sources including the US Department of Agriculture (USDA) and The New York Times (NYTimes). The first source contains socioeconomic and demographic information from all US counties, and was downloaded directly from the USDA website. The second source contains cumulative statistics for COVID cases and deaths across 397 days among the counties, which was shared and downloaded from NYTimes GitHub page.
Data preparation was performed by an experienced data management team using R. First, relevant data points were extracted from the datasets and variables were re-formatted to ensure that formats matched between data sources. Missing values in the COVID dataset was assumed to correspond to counties with no confirmed cases/deaths, and missing values were therefore replaced with “0.” The data was then cleaned to ensure that the number of cumulative cases remained the same or increased over time, and never decreased. Finally, the two datasets were merged using the identifying FIPS codes.
Analysis began with descriptive statistics and visualization of the data through the creation of case and fatality heatmaps by state. The preliminary analysis indicated clear differences in both case and fatality rates by state, and we therefore determined that state was an important factor which should be included in all final models. We then performed LASSO to choose a parsimonious model with all available sensible county-level information. Lambda was set at 0.0055, which was determined based on the results of k-fold cross-validation. The final variables were selected through fine-tuning of the model using Mallow’s Cp. All variables found to be significant after application of Mallow’s Cp were then used to model a linear regression with log(deaths/100,000K + 1) as the outcome. This final model consisted of 60 variables, including 48 indicator variables for state plus percentage of residents living below the poverty line, unemployment rate as of 2019, percent of employees working in construction, percent of employees working in agriculture, percent of residents aged 65 or older, percent of residents achieving some college or advanced degrees, the net migration change, natural change rate, percent of residents identifying as Hispanic, and the rural urban continuum code as of 2013.
Based on our final model, we found that the net migration and natural population change rates of the states from 2010 to 2019 contribute negatively to COVID-19 death rate, perhaps counter-intuitively suggesting that the states which did not restrict people’s movement had lower COVID-19 death rates. It was also noted that for each percentage increase of residents working in construction and agriculture, COVID death rates were 0.048/100,000K and 0.037/100,000K lower, respectively. This is possibly due to the outdoor nature of these professions and the fact that COVID-19 spreads less easily when people are spread apart and outdoors. Finally, we found that states with higher levels of poverty experiences greater COVID deaths, as did states with higher percentages of elderly and Hispanic individuals. Even after accounting for these sociodemographic differences, some states experienced significantly different death rates than others, indicating there are additional factors which impact COVID death rates which were not captured in our model.
This study has several strengths. First, it incorporates a vast amount of data using well-respected and validated, encompassing all counties in the continental US. We also leveraged state-of-the-art analytic methods to generate the most impactful and parsimonious model. It is, however, also subject to some limitations. There are a few additional factors which we could consider to improve our model. For example, factors that concern the outdoor environment such as level of air pollution, lifestyle of people in the counties, and quality of clinical care could be used to explain the disproportionate COVID death rate trend among different age and racial groups in our data. This analysis also represents a cross-sectional effect rather than a temporal one. Finally, in applying county-level data we are assuming that this data is representative of any given individual in the county, an assumption known as the ecological fallacy, which may impact our results should that assumption not be met.
A detailed summary of the variables in each data set follows:
Income: Poverty level and household income
Jobs: Employment type, rate, and change
UnempRate2007-2019: Unemployment rate, 2007-2019
NumEmployed2007-2019: Employed, 2007-2019
NumUnemployed2007-2019: Unemployed, 2007-2019
PctEmpChange1019: Percent employment change, 2010-19
PctEmpChange1819: Percent employment change, 2018-19
PctEmpChange0719: Percent employment change, 2007-19
PctEmpChange0710: Percent employment change, 2007-10
NumCivEmployed: Civilian employed population 16 years and over,
2014-18
NumCivLaborforce2007-2019: Civilian labor force, 2007-2019
PctEmpFIRE: Percent of the civilian labor force 16 and over
employed in finance and insurance, and real estate and rental and
leasing, 2014-18
PctEmpConstruction: Percent of the civilian labor force 16 and
over employed in construction, 2014-18
PctEmpTrans: Percent of the civilian labor force 16 and over employed in transportation, warehousing and utilities, 2014-18
PctEmpMining: Percent of the civilian labor force 16 and over
employed in mining, quarrying, oil and gas extraction, 2014-18
PctEmpTrade: Percent of the civilian labor force 16 and over
employed in wholesale and retail trade, 2014-18
PctEmpInformation: Percent of the civilian labor force 16 and
over employed in information services, 2014-18
PctEmpAgriculture: Percent of the civilian labor force 16 and
over employed in agriculture, forestry, fishing, and hunting,
2014-18
PctEmpManufacturing: Percent of the civilian labor force 16 and
over employed in manufacturing, 2014-18
PctEmpServices: Percent of the civilian labor force 16 and over
employed in services, 2014-18
PctEmpGovt: Percent of the civilian labor force 16 and over employed in public administration, 2014-18
People: Population size, density, education level, race, age, household size, and migration rates
PopDensity2010: Population density, 2010
LandAreaSQMiles2010: Land area in square miles, 2010
TotalHH: Total number of households, 2014-18
TotalOccHU: Total number of occupied housing units, 2014-18
AvgHHSize: Average household size, 2014-18
OwnHomeNum: Number of owner occupied housing units, 2014-18
OwnHomePct: Percent of owner occupied housing units, 2014-18
NonEnglishHHPct: Percent of non-English speaking households of
total households, 2014-18
HH65PlusAlonePct: Percent of persons 65 or older living alone,
2014-18
FemaleHHPct: Percent of female headed family households of total
households, 2014-18
FemaleHHNum: Number of female headed family households,
2014-18
NonEnglishHHNum: Number of non-English speaking households,
2014-18
HH65PlusAloneNum: Number of persons 65 years or older living alone, 2014-18
Age65AndOlderPct2010: Percent of population 65 or older, 2010
Age65AndOlderNum2010: Population 65 years or older, 2010
TotalPop25Plus: Total population 25 and older, 2014-18 - 5-year
average
Under18Pct2010: Percent of population under age 18, 2010
Under18Num2010: Population under age 18, 2010
Ed1LessThanHSPct: Percent of persons with no high school diploma
or GED, adults 25 and over, 2014-18
Ed2HSDiplomaOnlyPct: Percent of persons with a high school
diploma or GED only, adults 25 and over, 2014-18
Ed3SomeCollegePct: Percent of persons with some college
experience, adults 25 and over, 2014-18
Ed4AssocDegreePct: Percent of persons with an associate’s degree,
adults 25 and over, 2014-18
Ed5CollegePlusPct: Percent of persons with a 4-year college
degree or more, adults 25 and over, 2014-18
Ed1LessThanHSNum: No high school, adults 25 and over, 2014-18
Ed2HSDiplomaOnlyNum: High school only, adults 25 and over,
2014-18
Ed3SomeCollegeNum: Some college experience, adults 25 and over,
2014-18
Ed4AssocDegreeNum: Number of persons with an associate’s degree,
adults 25 and over, 2014-18
Ed5CollegePlusNum: College degree 4-years or more, adults 25 and over, 2014-18
ForeignBornPct: Percent of total population foreign born,
2014-18
ForeignBornEuropePct: Percent of persons born in Europe,
2014-18
ForeignBornMexPct: Percent of persons born in Mexico, 2014-18
ForeignBornCentralSouthAmPct: Percent of persons born in Central
or South America, 2014-18
ForeignBornAsiaPct: Percent of persons born in Asia, 2014-18
ForeignBornCaribPct: Percent of persons born in the Caribbean,
2014-18
ForeignBornAfricaPct: Percent of persons born in Africa,
2014-18
ForeignBornNum: Number of people foreign born, 2014-18
ForeignBornCentralSouthAmNum: Number of persons born in Central
or South America, 2014-18
ForeignBornEuropeNum: Number of persons born in Europe,
2014-18
ForeignBornMexNum: Number of persons born in Mexico, 2014-18
ForeignBornAfricaNum: Number of persons born in Africa,
2014-18
ForeignBornAsiaNum: Number of persons born in Asia, 2014-18
ForeignBornCaribNum: Number of persons born in the Caribbean, 2014-18
Net_International_Migration_Rate_2010_2019: Net international
migration rate, 2010-19
Net_International_Migration_2010_2019: Net international
migration, 2010-19
Net_International_Migration_2000_2010: Net international
migration, 2000-10
Immigration_Rate_2000_2010: Net international migration rate,
2000-10
NetMigrationRate0010: Net migration rate, 2000-10
NetMigrationRate1019: Net migration rate, 2010-19
NetMigrationNum0010: Net migration, 2000-10
NetMigration1019: Net Migration, 2010-19
NaturalChangeRate1019: Natural population change rate,
2010-19
NaturalChangeRate0010: Natural population change rate,
2000-10
NaturalChangeNum0010: Natural change, 2000-10
NaturalChange1019: Natural population change, 2010-19
TotalPop2010: Population size 4/1/2010 Census
TotalPopEst2010: Population size 7/1/2010
TotalPopEst2011: Population size 7/1/2011
TotalPopEst2012: Population size 7/1/2012
TotalPopEst2013: Population size 7/1/2013
TotalPopEst2014: Population size 7/1/2014
TotalPopEst2015: Population size 7/1/2015
TotalPopEst2016: Population size 7/1/2016
TotalPopEst2017: Population size 7/1/2017
TotalPopEst2018: Population size 7/1/2018
TotalPopEst2019: Population size 7/1/2019
TotalPopACS: Total population, 2014-18 - 5-year average
TotalPopEstBase2010: County Population estimate base 4/1/2010
NonHispanicAsianPopChangeRate0010: Population change rate
Non-Hispanic Asian, 2000-10
PopChangeRate1819: Population change rate, 2018-19
PopChangeRate1019: Population change rate, 2010-19
PopChangeRate0010: Population change rate, 2000-10
NonHispanicNativeAmericanPopChangeRate0010: Population change
rate Non-Hispanic Native American, 2000-10
HispanicPopChangeRate0010: Population change rate Hispanic,
2000-10
MultipleRacePopChangeRate0010: Population change rate multiple
race, 2000-10
NonHispanicWhitePopChangeRate0010: Population change rate
Non-Hispanic White, 2000-10
NonHispanicBlackPopChangeRate0010: Population change rate Non-Hispanic African American, 2000-10
MultipleRacePct2010: Percent multiple race, 2010
WhiteNonHispanicPct2010: Percent Non-Hispanic White, 2010
NativeAmericanNonHispanicPct2010: Percent Non-Hispanic Native
American, 2010
BlackNonHispanicPct2010: Percent Non-Hispanic African American,
2010
AsianNonHispanicPct2010: Percent Non-Hispanic Asian, 2010
HispanicPct2010: Percent Hispanic, 2010
MultipleRaceNum2010: Population size multiple race, 2010
WhiteNonHispanicNum2010: Population size Non-Hispanic White,
2010
BlackNonHispanicNum2010: Population size Non-Hispanic African
American, 2010
NativeAmericanNonHispanicNum2010: Population size Non-Hispanic
Native American, 2010
AsianNonHispanicNum2010: Population size Non-Hispanic Asian,
2010
HispanicNum2010: Population size Hispanic, 2010
##County classifications
Type of county (rural or urban on a rural-urban continuum scale)
Type_2015_Recreation_NO: Recreation counties, 2015 edition
Type_2015_Farming_NO: Farming-dependent counties, 2015
edition
Type_2015_Mining_NO: Mining-dependent counties, 2015 edition
Type_2015_Government_NO: Federal/State government-dependent
counties, 2015 edition
Type_2015_Update: County typology economic types, 2015
edition
Type_2015_Manufacturing_NO: Manufacturing-dependent counties,
2015 edition
Type_2015_Nonspecialized_NO: Nonspecialized counties, 2015
edition
RecreationDependent2000: Nonmetro recreation-dependent,
1997-00
ManufacturingDependent2000: Manufacturing-dependent, 1998-00
FarmDependent2003: Farm-dependent, 1998-00
EconomicDependence2000: Economic dependence, 1998-00
RuralUrbanContinuumCode2003: Rural-urban continuum code, 2003
UrbanInfluenceCode2003: Urban influence code, 2003
RuralUrbanContinuumCode2013: Rural-urban continuum code, 2013
UrbanInfluenceCode2013: Urban influence code, 2013
Noncore2013: Nonmetro noncore, outside Micropolitan and
Metropolitan, 2013
Micropolitan2013: Micropolitan, 2013
Nonmetro2013: Nonmetro, 2013
Metro2013: Metro, 2013
Metro_Adjacent2013: Nonmetro, adjacent to metro area, 2013
Noncore2003: Nonmetro noncore, outside Micropolitan and
Metropolitan, 2003
Micropolitan2003: Micropolitan, 2003
Metro2003: Metro, 2003
Nonmetro2003: Nonmetro, 2003
NonmetroNotAdj2003: Nonmetro, nonadjacent to metro area, 2003
NonmetroAdj2003: Nonmetro, adjacent to metro area, 2003
Oil_Gas_Change: Change in the value of onshore oil and natural
gas production, 2000-11
Gas_Change: Change in the value of onshore natural gas
production, 2000-11
Oil_Change: Change in the value of onshore oil production, 2000-11
Hipov: High poverty counties, 2014-18
Perpov_1980_0711: Persistent poverty counties, 2015 edition
PersistentChildPoverty_1980_2011: Persistent child poverty
counties, 2015 edition
PersistentChildPoverty2004: Persistent child poverty counties,
2004
PersistentPoverty2000: Persistent poverty counties, 2004
Low_Education_2015_update: Low education counties, 2015 edition
LowEducation2000: Low education, 2000
HiCreativeClass2000: Creative class, 2000
HiAmenity: High natural amenities
RetirementDestination2000: Retirement destination, 1990-00
Low_Employment_2015_update: Low employment counties, 2015 edition
Population_loss_2015_update: Population loss counties, 2015 edition
Retirement_Destination_2015_Update: Retirement destination counties, 2015 edition
The raw data sets are dirty and need transforming before we can do
our EDA. It takes time and efforts to clean and merge different data
sources so we provide the final output of the cleaned and merged data.
The cleaning procedure is as follows. Please read through to understand
what is in the cleaned data. We set eval = data_cleaned in
the following cleaning chunks so that these cleaning chunks will only
run if any of data/covid_county.csv,
data/covid_rates.csv or
data/covid_intervention.csv does not exist.
# Indicator to check whether the data files exist
data_cleaned <- !(file.exists("data/covid_county.csv") &
file.exists("data/covid_rates.csv") &
file.exists("data/covid_intervention.csv"))We first read in the table using data.table::fread(), as
we did last time.
# COVID case/mortality rate data
covid_rates <- fread("data/us_counties.csv", na.strings = c("NA", "", "."))
nyc <- fread("data/nycdata.csv", na.strings = c("NA", "", "."))
# Socioeconomic data
income <- fread("data/income.csv", na.strings = c("NA", "", "."))
jobs <- fread("data/jobs.csv", na.strings = c("NA", "", "."))
people <- fread("data/people.csv", na.strings = c("NA", "", "."))
county_class <- fread("data/county_classifications.csv", na.strings = c("NA", "", "."))
# Internvention policy data
int_dates <- fread("data/intervention_dates.csv", na.strings = c("NA", "", "."))The original NYC data contains more information than we need. We
extract only the number of cases and deaths and format it the same as
the covid_rates data.
# NYC county fips matching table
nyc_fips <- data.table(FIPS = c('36005', '36047', '36061', '36081', '36085'),
County = c("BX", "BK", "MN", "QN", "SI"))
# nyc case
nyc_case <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
BX = BX_CASE_COUNT,
BK = BK_CASE_COUNT,
MN = MN_CASE_COUNT,
QN = QN_CASE_COUNT,
SI = SI_CASE_COUNT)]
nyc_case %<>%
pivot_longer(cols = BX:SI,
names_to = "County",
values_to = "cases") %>%
arrange(date) %>%
group_by(County) %>%
mutate(cum_cases = cumsum(cases))
# nyc death
nyc_death <- nyc[,.(date = as.Date(date_of_interest, "%m/%d/%Y"),
BX = BX_DEATH_COUNT,
BK = BK_DEATH_COUNT,
MN = MN_DEATH_COUNT,
QN = QN_DEATH_COUNT,
SI = SI_DEATH_COUNT)]
nyc_death %<>%
pivot_longer(cols = BX:SI,
names_to = "County",
values_to = "deaths") %>%
arrange(date) %>%
group_by(County) %>%
mutate(cum_deaths = cumsum(deaths))
nyc_rates <- merge(nyc_case,
nyc_death,
by = c("date", "County"),
all.x= T)
nyc_rates <- merge(nyc_rates,
nyc_fips,
by = "County")
nyc_rates$State <- "New York"
nyc_rates %<>%
select(date, FIPS, County, State, cum_cases, cum_deaths) %>%
arrange(FIPS, date)We only consider cases and death in continental US. Alaska, Hawaii,
and Puerto Rico have 02, 15, and 72 as their respective first 2 digits
of their FIPS. We use the %/% operator for integer division
to get the first 2 digits of FIPS. We also remove Virgin Islands and
Northern Mariana Islands. All data of counties in NYC are aggregated as
County == "New York City" in covid_rates with
no FIPS, so we combine the NYC data into covid_rate.
covid_rates <- covid_rates %>%
arrange(fips, date) %>%
filter(!(fips %/% 1000 %in% c(2, 15, 72))) %>%
filter(county != "New York City") %>%
filter(!(state %in% c("Virgin Islands", "Northern Mariana Islands"))) %>%
rename(FIPS = "fips",
County = "county",
State = "state",
cum_cases = "cases",
cum_deaths = "deaths")
covid_rates$date <- as.Date(covid_rates$date)
covid_rates <- rbind(covid_rates,
nyc_rates)We set the week of Jan 21, 2020 (the first case of COVID case in US) as the first week (2020-01-19 to 2020-01-25).
covid_rates[, week := (interval("2020-01-19", date) %/% weeks(1)) + 1]Merge the TotalPopEst2019 variable from the demographic
data with covid_rates by FIPS.
covid_rates <- merge(covid_rates[!is.na(FIPS)],
people[,.(FIPS = as.character(FIPS),
TotalPopEst2019)],
by = "FIPS",
all.x = TRUE)NA values in the covid_rates data set correspond to a
county not having confirmed cases/deaths. We replace the NA values in
these columns with zeros. FIPS for Kansas city, Missouri, Rhode Island
and some others are missing. We drop them for the moment and output the
data up to week 57 as covid_rates.csv.
covid_rates$cum_cases[is.na(covid_rates$cum_cases)] <- 0
covid_rates$cum_deaths[is.na(covid_rates$cum_deaths)] <- 0fwrite(covid_rates %>%
filter(week < 58) %>%
arrange(FIPS, date),
"data/covid_rates.csv")int_datesWe convert the columns representing dates in int_dates
to R Date types using as.Date(). We will need to specify
that the origin parameter is "0001-01-01". We
output the data as covid_intervention.csv.
int_dates <- int_dates[-1,]
date_cols <- names(int_dates)[-(1:3)]
int_dates[, (date_cols) := lapply(.SD, as.Date, origin = "0001-01-01"),
.SDcols = date_cols]
fwrite(int_dates, "data/covid_intervention.csv")Merge the demographic data sets by FIPS and output as
covid_county.csv.
countydata <-
merge(x = income,
y = merge(
x = people,
y = jobs,
by = c("FIPS", "State", "County")),
by = c("FIPS", "State", "County"),
all = TRUE)
countydata <-
merge(
x = countydata,
y = county_class %>% rename(FIPS = FIPStxt),
by = c("FIPS", "State", "County"),
all = TRUE
)
# Check dimensions
# They are now 3279 x 208
dim(countydata)
fwrite(countydata, "data/covid_county.csv")